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not  been  explored.  In  particular,  a  celebrated  result  of  Bourgain  and  Tzafriri  demonstrates  that 
each  matrix  with  normalized  columns  contains  a  large  column  submatrix  that  is  exceptionally  well 
conditioned.  Unfortunately,  standard  proofs  of  this  result  cannot  be  regarded  as  algorithmic. 

This  paper  presents  a  randomized,  polynomial-time  algorithm  that  produces  the  submatrix 
promised  by  Bourgain  and  Tzafriri.  The  method  involves  random  sampling  of  columns,  followed 
by  a  matrix  factorization  that  exposes  the  well-conditioned  subset  of  columns.  This  factorization, 
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1.  Introduction 


Column  subset  selection  refers  to  the  challenge  of  extracting  from  a  matrix  a  column  submatrix 
that  has  some  distinguished  property.  These  properties  commonly  involve  conditions  on  the  spec¬ 
trum  of  the  submatrix.  The  most  familiar  example  is  probably  rank-revealing  QR,  which  seeks  a 
well-conditioned  collection  of  columns  that  spans  the  (numerical)  range  of  the  matrix  [GE96]. 

The  literature  on  geometric  functional  analysis  contains  several  fundamental  theorems  on  column 
subset  selection  that  have  not  been  discussed  by  the  algorithms  community  or  the  numerical  linear 
algebra  community.  These  results  are  phrased  in  terms  of  the  stable  rank  of  a  matrix: 


st.rank(A)  = 


1 2 
If 


where  ||-||F  is  the  Frobenius  norm  and  ||-||  is  the  spectral  norm.  The  stable  rank  can  be  viewed  as 
an  analytic  surrogate  for  the  algebraic  rank.  Indeed,  express  the  two  norms  in  terms  of  singular 
values  to  obtain  the  relation 

st.rank(A)  <  rank(A). 

In  this  bound,  equality  occurs  (for  example)  when  the  columns  of  A  are  identical  or  when  the 
columns  of  A  are  orthonormal.  As  we  will  see,  the  stable  rank  is  tightly  connected  with  the 
number  of  (strongly)  linearly  independent  columns  we  can  extract  from  a  matrix. 

Before  we  continue,  let  us  instate  some  notation.  We  say  that  a  matrix  is  standardized  when 
its  columns  have  unit  I2  norm.  The  jth  column  of  a  matrix  A  is  denoted  by  aj.  For  a  subset  r 
of  column  indices,  we  write  Ar  for  the  column  submatrix  indexed  by  r.  Likewise,  given  a  square 
matrix  H,  the  notation  HTXT  refers  to  the  principal  submatrix  whose  rows  and  columns  are  listed 
in  t.  The  pseudo  inverse  D t  of  a  diagonal  matrix  D  is  formed  by  reciprocating  the  nonzero  entries. 
As  usual,  we  write  ]]•]]  for  the  iv  vector  norm.  The  condition  number  of  a  matrix  is  the  quantity 


k(A)  =  max 


\Ax\ 

\Av\\ 


X 


=  II2/II2  = 1 


Finally,  upright  letters  (c,  C,K, ...)  refer  to  positive,  universal  constants  that  may  change  from 
appearance  to  appearance. 

The  first  theorem,  due  to  Kashin  and  Tzafriri,  shows  that  each  matrix  with  standardized  columns 
contains  a  large  column  submatrix  that  has  small  spectral  norm  [VerOl,  Thm.  2.5]. 


Theorem  1.1  (Kashin  Tzafriri).  Suppose  A  is  standardized.  Then  there  is  a  set  r  of  column 
indices  for  which 

|t|  >  st.rank(A)  and  ||Ar||  <  C. 


In  fact,  much  more  is  true.  Combining  Theorem  1.1  with  the  celebrated  restricted  invertibility 
result  of  Bourgain  and  Tzafriri  [BT87,  Thm.  1.2],  we  find  that  every  standardized  matrix  contains 
a  large  column  submatrix  whose  condition  number  is  small. 

Theorem  1.2  (Bougain-Tzafriri).  Suppose  A  is  standardized.  Then  there  is  a  set  r  of  column 
indices  for  which 

|t|  >  c  •  st.rank(A)  and  k(At)  <  a/3. 

Theorem  1.2  yields  the  best  general  result  [BT91,  Thm.  1.1]  on  the  Kadison-Singer  conjecture, 
the  major  open  question  in  operator  theory.  To  display  its  strength,  let  us  consider  two  extreme 
examples. 

(1)  When  A  has  identical  columns,  every  collection  of  two  or  more  columns  is  singular.  Theo¬ 
rem  1.2  guarantees  a  well-conditioned  submatrix  Ar  with  |r|  =  1,  which  is  optimal. 

(2)  When  A  has  n  orthonormal  columns,  the  full  matrix  is  perfectly  conditioned.  Theorem  1.2 
guarantees  a  well-conditioned  submatrix  Ar  with  |r|  >  cn,  which  lies  within  a  constant 
factor  of  optimal. 
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The  stable  rank  allows  Theorem  1.2  to  interpolate  between  the  two  extremes.  Subsequent  research 
established  that  the  stable  rank  is  intrinsic  to  the  problem  of  finding  well-conditioned  submatrices. 
We  postpone  a  more  detailed  discussion  of  this  point  until  Section  6. 

1.1.  Contributions.  Although  Theorems  1.1  and  1.2  would  be  very  useful  in  computational  ap¬ 
plications,  we  cannot  regard  current  proofs  as  constructive.  The  goal  of  this  paper  is  to  establish 
the  following  novel,  algorithmic  claim. 

Theorem  1.3.  There  are  randomized,  polynomial-time  algorithms  for  producing  the  sets  guaranteed 
by  Theorem  1.1  and  by  Theorem  1.2. 

This  result  is  significant  because  no  known  algorithm  for  column  subset  selection  is  guaranteed 
to  produce  a  submatrix  whose  condition  number  has  constant  order.  See  [BDM08]  for  a  recent 
overview  of  that  literature.  The  present  work  has  other  ramifications  with  independent  interest. 

•  We  develop  algorithms  for  computing  the  matrix  factorizations  of  Pietsch  and  Grothendieck, 
which  are  regarded  as  basic  instruments  in  modern  functional  analysis  [Pis86] . 

•  The  methods  for  computing  these  factorizations  lead  to  new  approximation  algorithms  for 
two  NP-hard  matrix  norms.  (See  Remarks  3.2  and  5.6.) 

•  We  identify  an  intriguing  connection  between  Pietsch  factorization  and  the  MAXCUT  semi- 
definite  program  [GW95]. 

1.2.  Overview.  We  focus  on  the  algorithmic  version  of  the  Kashin-Tzafriri  theorem  because  it 
highlights  all  the  essential  concepts  while  minimizing  irrelevant  details.  Section  2  outlines  a  proof  of 
this  result,  emphasizing  where  new  algorithmic  machinery  is  required.  The  missing  link  turns  out 
to  be  a  computational  method  for  producing  a  certain  matrix  factorization.  Section  3  reformulates 
the  factorization  problem  as  an  eigenvalue  minimization,  which  can  be  completed  with  standard 
techniques.  In  Section  4,  we  exhibit  a  randomized  algorithm  that  delivers  the  submatrix  promised 
by  Kashin-Tzafriri.  In  Section  5,  we  traverse  a  similar  route  to  develop  an  algorithmic  version  of 
Bourgain-  Tzafriri.  Section  6  provides  more  details  about  the  stable  rank  and  describes  directions 
for  future  work.  Appendix  A  contains  some  key  estimates  on  the  norms  of  random  submatrices, 
and  Appendix  B  outlines  a  simple  computational  procedure  for  solving  the  eigenvalue  optimization 
problems  that  arise  in  our  work. 

2.  The  Kashin-Tzafriri  Theorem 

The  proof  of  the  Kashin-Tzafriri  theorem  proceeds  in  two  steps.  First,  we  select  a  random  set 
of  columns  with  appropriate  cardinality.  Second,  we  use  a  matrix  factorization  to  identify  and 
remove  redundant  columns  that  inflate  the  spectral  norm.  The  proof  gives  strong  hints  about  how 
a  computational  procedure  might  work,  even  though  it  is  not  constructive. 

2.1.  Intuitions.  We  would  like  to  think  that  a  random  submatrix  inherits  its  share  of  the  norm 
of  the  entire  matrix.  In  other  words,  if  we  were  to  select  a  tenth  of  the  columns,  we  might  hope  to 
reduce  the  norm  by  a  factor  of  ten.  Unfortunately,  this  intuition  is  meretricious. 

Indeed,  random  selection  does  not  necessarily  reduce  the  spectral  norm  at  all.  The  essential 
reason  emerges  when  we  consider  the  “double  identity,”  the  m  x  2m  matrix  A  =  [I  |  I] .  Suppose 
we  draw  s  random  columns  from  A  without  replacement.  The  probability  that  all  s  columns  are 
distinct  is 

x  x  ...  x  2m  -  2<s  -  (>  <  |U  ( 1  -  JA  ^expl-y*'1  4_\ 

2m  —  1  2m  —  2  2m  —  (s  —  1)  y  2m  J  \  ^—'j= o  2m ) 

Therefore,  when  s  =  f l(y/m),  sampling  almost  always  produces  a  submatrix  with  at  least  one 
duplicated  column.  A  duplicated  column  means  that  the  norm  of  the  submatrix  is  \J~2,  which 
equals  the  norm  of  the  full  matrix,  so  no  reduction  takes  place. 
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Nevertheless,  a  randomly  chosen  set  of  columns  from  a  standardized  matrix  typically  contains 
a  large  set  of  columns  that  has  small  norm.  We  will  see  that  the  desired  subset  is  exposed  by 
factoring  the  random  submatrix.  This  factorization,  which  was  invented  by  Pietsch,  is  regarded  as 
a  basic  instrument  in  modern  functional  analysis. 

2.2.  The  (oo,  2)  operator  norm.  Although  sampling  does  not  necessarily  reduce  the  spectral 
norm,  it  often  reduces  other  matrix  norms.  Define  the  natural  norm  on  linear  operators  from  l ^ 
to  I2  via  the  expression 

Halloo— >2  =  max{  1 1 -Ba:  ||  2  :  11*11^  =  1}. 

An  immediate  consequence  is  that  ||^||00_>2  —  V®  f°r  each  matrix  B  with  s  columns.  Equality 
can  obtain  in  this  bound. 

The  exact  calculation  of  the  (00, 2)  operator  norm  is  computationally  difficult.  Results  of 
Rohn  [RohOO]  imply  that  there  is  a  class  of  positive  semidefinite  matrices  for  which  it  is  NP- 
hard  to  estimate  ||-||  ^2  within  an  absolute  tolerance.  Nevertheless,  we  will  see  that  the  norm  can 
be  approximated  in  polynomial  time  up  to  a  small  relative  error.  (See  Remark  3.2.) 

As  we  have  intimated,  the  (00,  2)  norm  can  often  be  reduced  by  random  selection.  The  following 
theorem  requires  some  heavy  lifting,  which  we  delegate  to  Appendix  A. 2. 

Theorem  2.1.  Suppose  A  is  a  standardized  matrix  with  n  columns.  Choose 

s  <  |~2st.rank(A)~|, 

and  draw  a  uniformly  random  subset  a  with  cardinality  s  from  {1,2,...,  n}.  Then 

®  1 1  1 1  oo — >2  —  7V^- 

In  particular,  ||  A(T||0O__>2  <  8^/s  w probability  at  least  1/8. 

2.3.  Pietsch  factorization.  We  cannot  exploit  the  bound  in  Theorem  2.1  unless  we  have  a  way 
to  connect  the  (00,  2)  norm  with  the  spectral  norm.  To  that  end,  let  us  recall  one  of  the  landmark 
theorems  of  functional  analysis. 

Theorem  2.2  (Pietsch  Factorization).  Each  matrix  B  can  be  factored  as  B  =  TD  where 

•  D  is  a  nonnegative,  diagonal  matrix  with  trace(T>2)  =  1,  and 

•  Halloo— >2  <  llTll  <  KP  II ^ II 00— >2 ' 

This  result  follows  from  the  little  Grothendieck  theorem  [Pis86,  Sec.  5b]  and  the  Pietsch  factor¬ 
ization  theorem  [Pis86,  Cor.  1.8].  The  standard  proof  produces  the  factorization  using  an  abstract 
separation  argument  that  offers  no  algorithmic  insight.  The  value  of  the  constant  is  available. 

•  When  the  scalar  field  is  real,  we  have  Kp(M)  =  y7) r/2  ~  1.25. 

•  When  the  scalar  field  is  complex,  we  have  Kp(C)  =  y/4/7 r  ~  1.13. 

A  major  application  of  Pietsch  factorization  is  to  identify  a  submatrix  with  controlled  spectral 
norm.  The  following  proposition  describes  the  procedure. 

Proposition  2.3.  Suppose  B  is  a  matrix  with  s  columns.  Then  there  is  a  set  r  of  column  indices 
for  which 

s  [2 

m>2  and  IISt||  <  KP\j-  \\B\\oo->2- 

Proof.  Consider  a  Pietsch  factorization  B  =  TD,  and  define 

r  =  {j  :  4,  <  2/s}. 

Since  J}  rjA  =  1,  Markov’s  inequality  implies  that  |r|  >  s/2.  We  may  calculate  that 

TDt ||  <  ||T||  •  ||Dr||  <  Kp  || ^ || 00_>2  •  VVs. 


||  Br 

This  completes  the  proof. 


□ 
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2.4.  Proof  of  Kashin— Tzafriri.  With  these  results  at  hand,  we  easily  complete  the  proof  of 
the  Kashin  Tzafriri  theorem.  Suppose  A  is  a  standardized  matrix  with  n  columns.  Assume  that 
st.rank(A)  <  n/2.  Otherwise,  the  spectral  norm  ||A||  <  y/2,  so  we  may  select  r  =  {1,2, . . .  ,n}. 
According  to  Theorem  2.1,  there  is  a  subset  a  of  column  indices  for  which 

|cr|  >  2st.rank(A)  and  HA^H^^  <  8\/|cj|. 


Apply  Proposition  2.3  to  the  matrix  B  =  Aa  to  obtain  a  subset  r  inside  a  for  which 


u 


r\  >  —  and  \\BT\\  <KPJ  —  \\B\ 


cr 


oo— >2  * 


Since  Br  =  Ar  and  Kp  <  y/n/2,  these  bounds  reveal  the  advertised  conclusion: 

|t|  >st.rank(A)  and  ||Ar|j  <  15. 


At  this  point,  we  take  a  step  back  and  notice  that  this  proof  is  nearly  algorithmic.  It  is  straight¬ 
forward  to  perform  the  random  selection  described  in  Theorem  2.1.  Provided  that  we  know  a 
Pietsch  factorization  of  the  matrix  B,  we  can  easily  carry  out  the  column  selection  of  Proposi¬ 
tion  2.3.  Therefore,  we  need  only  develop  an  algorithm  for  computing  the  Pietsch  factorization  to 
reach  an  effective  version  of  the  Kashin-  Tzafriri  theorem. 


3.  Pietsch  Factorization  via  Convex  Optimization 

The  main  novelty  is  to  demonstrate  that  we  can  produce  a  Pietsch  factorization  by  solving  a 
convex  programming  problem.  Remarkably,  the  resulting  optimization  is  the  dual  of  the  famous 
maxcut  semidefinite  program  [GW95],  for  which  many  polynomial-time  algorithms  are  available. 

3.1.  Pietsch  and  eigenvalues.  The  next  theorem,  which  serves  as  the  basis  for  our  computational 
method,  demonstrates  that  Pietsch  factorizations  have  an  intimate  relationship  with  the  eigenvalues 
of  a  related  matrix.  In  the  sequel,  we  reserve  the  letter  D  for  a  nonnegative,  diagonal  matrix  with 
trace(.D2)  =  1,  and  we  write  Amax  for  the  algebraically  maximal  eigenvalue  of  a  Hermitian  matrix. 

Theorem  3.1.  The  factorization  B  =  TD  satisfies  ||T||  <  a  if  and  only  if  D  satisfies 

X max(B*B  -  a2 D2)  <  0. 

In  particular,  if  no  D  verifies  this  bound,  then  no  factorization  B  =  TD  admits  ||T||  <  a. 

Proof.  Assume  B  has  a  factorization  B  =  TD  with  ||T||  <  a.  We  have  the  chain  of  implications 

Jl-Bas  H2  =  ||T.D®||2  V® 

j|i?®||2  <  a2  j| .Dec || 2  Vx 

x*B*Bx  <  a2x* D2x  V® 

x* {B* B  -  a2 D2)x  <  0  V® 

B*B  -  a2D2  =4  0, 

where  denotes  the  semidefinite,  or  Lowner,  ordering  on  Hermitian  matrices. 

Conversely,  assume  we  are  provided  the  inequality 

B*B-a2D2^  0.  (3.1) 

First,  we  claim  that  any  zero  entry  in  D  corresponds  with  a  zero  column  of  B.  To  check  this  point, 
suppose  that  djj  =  0  for  an  index  j.  The  relation  (3.1)  requires  that 

0  >(B*B-a2D2)jj  =  b*bj. 
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This  inequality  is  impossible  unless  bj  =  0.  To  continue,  set  T  =  BD\  and  observe  that  B  =  TD 
because  the  zero  entries  of  D  correspond  with  zero  columns  of  B.  Therefore,  we  may  factor  the 
diagonal  matrix  out  from  (3.1)  to  reach 

D(T*T  -  a2P)D  ^  0. 

where  the  matrix  P  =  DD t  is  an  orthogonal  projector.  Sylvester’s  theorem  on  inertia  [HJ85, 
Thm.  4.5.8]  ensures  that  T*T  —  a2P  ^  0.  Since  P  is  a  projector,  this  relation  implies  that 

T*T  4  a2P  4  a2 1. 

We  conclude  that  ||Tj|  <  a.  □ 

3.2.  Factorization  via  optimization.  Recall  that  the  maximum  eigenvalue  is  a  convex  function 
on  the  space  of  Hermitian  matrices,  so  it  can  be  minimized  in  polynomial  time  [L096].  We  are  led 
to  consider  the  convex  program 

min  Amax (-£?*_£?  —  a2F)  subject  to  trace(-F)  =  1,  F  diagonal,  and  F  >  0.  (3.2) 

Owing  to  Theorem  3.1,  there  exists  a  factorization  B  =  TD  with  ||T||  <  a  if  and  only  if  the  value 
of  (3.2)  is  nonpositive. 

Now,  if  F  is  a  feasible  point  of  (3.2)  with  a  nonpositive  objective  value,  we  can  factorize 

B  =  TD  with  D  =  F1/2,  T  =  BD] ,  and  ||T||  <  a. 

In  fact,  it  is  not  necessary  to  solve  (3.2)  to  optimality.  Suppose  B  has  s  columns,  and  assume  we 
have  identified  a  feasible  point  F  with  a  (positive)  objective  value  ?/.  That  is, 

Kmx(B*B  -  a2 F)  <  7], 

Rearranging  this  relation,  we  reach 

Amax  B*B  —  (a2  +  r]s)F  <  0  where  F  =  — ^ - (a2F  +  77I) . 

L  J  or  +  7]s 

Since  F  is  positive  and  diagonal  with  trace  (F1)  =  1,  we  obtain  the  factorization 

B  =  TD  with  D  =  1 F1/2,  T  =  BD  1 ,  and  ||T||  <  1/a2  +  rjs. 

To  select  a  target  value  for  the  parameter  a,  we  look  to  the  proof  of  the  Kashin  Tzafriri  theorem. 
If  B  has  s  columns,  then  a  =  8Kp^/s  is  an  appropriate  choice.  Furthermore,  since  the  argument 
only  uses  the  bound  ||Tj|  =  0(i/s),  it  suffices  to  solve  (3.2)  with  precision  77  =  0(1). 

3.3.  Other  formulations.  In  a  general  setting,  a  target  value  for  a  is  not  likely  to  be  available. 
Let  us  exhibit  an  alternative  formulation  of  (3.2)  that  avoids  this  inconvenience. 

min  A max(B*B  —  E)  +  trace(-F)  subject  to  E  diagonal,  E  >  0.  (3.3) 

Suppose  a*  is  the  minimal  value  of  ||T||  achievable  in  any  Pietsch  factorization  B  =  TD.  It  can  be 
shown  that  a2  is  the  value  of  (3.3)  and  that  each  optimizer  E+  satisfies  trace(Fl*)  =  a2.  As  such, 
we  can  construct  an  optimal  Pietsch  factorization  from  a  minimizer: 

B  =  TD  with  D  =  (F7*/  trace(£:*))1'/2,  T  =  BD\  and  ||T||  =  a*. 

The  dual  of  (3.3)  is  the  semidefinite  program 

max  ( B* B ,  Z)  subject  to  diag(Z)  =  I  and  Z  0.  (3.4) 

This  is  the  famous  MAXCUT  semidefinite  program  [GW95].  We  find  an  unexpected  connection 
between  Pietsch  factorization  and  the  problem  of  partitioning  nodes  of  a  graph. 
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Given  a  dual  optimum,  we  can  easily  construct  a  primal  optimum  by  means  of  the  complemen¬ 
tary  slackness  condition  [Ali95,  Thm.  2.10].  Indeed,  each  feasible  optimal  pair  (E*,Z+)  satisfies 
Z+(B*B  —  E+)  =  0.  Examining  the  diagonal  elements  of  this  matrix  equation,  we  find  that 

E±  =  diag(.E*)  =  diag  (ZE*)  =  diag  (Z+B*B) 

owing  to  the  constraint  diag(-Z*)  =  I.  Obtaining  a  dual  optimum  from  a  primal  optimum,  however, 
requires  more  ingenuity. 

Remark  3.2.  According  to  Theorem  2.2  and  the  discussion  here,  the  optimal  value  of  (3.3)  over- 

1 1 2  2  /t 

estimates  ||R||00_>2  by  a  multiplicative  factor  no  greater  than  Kp  .  As  a  result,  the  optimization 
problem  (3.3)  can  be  used  to  design  an  approximation  algorithm  for  (oo,  2)  norms. 

3.4.  Algorithmic  aspects.  The  purpose  of  this  paper  is  not  to  rehash  methods  for  solving  a 
standard  optimization  problem,  so  we  keep  this  discussion  brief.  It  is  easy  to  see  that  (3.2)  can  be 
framed  as  a  (nonsnrooth)  convex  optimization  over  the  probability  simplex.  Appendix  B  outlines 
an  elegant  technique,  called  Entropic  Mirror  Descent  [BT03],  designed  specifically  for  this  class 
of  problems.  Although  the  EMD  algorithm  is  (theoretically)  not  the  most  efficient  approach  to 
(3.2),  preliminary  experiments  suggest  that  its  empirical  performance  rivals  more  sophisticated 
techniques. 

For  a  concrete  time  bound,  we  refer  to  Alizadeh’s  work  on  primal-dual  potential  reduction 
methods  for  semidefinite  programming  [Ali95] .  When  B  has  dimension  m  x  s,  the  cost  of  forming 
B*B  is  at  most  O (s2m).  Then  the  cost  of  solving  (3.4)  is  no  more  than  0(s3'5),  where  the  tilde 
indicates  that  log-like  factors  are  suppressed. 


4.  An  Algorithm  for  Kashin-Tzafriri 


At  this  point,  we  have  amassed  the  materiel  necessary  to  deploy  an  algorithm  that  constructs  the 
set  r  promised  by  the  Kashin-Tzafriri  theorem.  The  procedure  appears  on  page  11  as  Algorithm  1. 
The  following  result  describes  its  performance. 


Theorem  4.1.  Suppose  A  is  an  m  x  n  standardized  matrix.  With  probability  at  least  4/5,  Algo¬ 
rithm  1  produces  a  set  t  =  r*  of  column  indices  for  which 

|t|  >  —  st.rank(A)  and  ||Ar||  <  15. 

The  computational  cost  is  bounded  by  0(|r|2m+  |r|3"5). 

Remarkably,  Algorithm  1  is  sublinear  in  the  size  of  the  matrix  when  st.rank(A)  =  o (n1/3'5). 
Better  methods  for  solving  (3.2)  would  strengthen  this  bound. 


Proof.  According  to  Section  2,  the  procedure  Norm- Reduce  has  failure  probability  less  than  7/8 
when  s  <  2  st.  rank(A).  The  probability  the  inner  loop  fails  to  produce  an  acceptable  set  r*  of  size 
s/2  is  at  most  (7/8)81og2^.  So  the  probability  the  algorithm  fails  before  s  >  st.rank(A)  is  at  most 

(7/8)16 


ECXJ 

3= 2 


(7/8)' 8j  = 


1  -  (7/8)8 


<  0.2. 


With  constant  probability,  we  obtain  a  set  r*  with  cardinality  at  least  st.  rank(A)/2. 

The  cost  of  the  procedure  Norm- Reduce  is  dominated  by  the  cost  of  the  Pietsch  factorization, 
which  is  0(s2m  +  s3"5)  for  a  fixed  s.  Summing  over  s  and  k,  we  find  that  the  total  cost  of  all 
the  invocations  of  Norm-Reduce  is  dominated  (up  to  logarithmic  factors)  by  the  cost  of  the  final 
invocation,  during  which  the  parameter  s  <  2  |r*|. 

An  estimate  of  the  spectral  norm  of  Ar  can  be  obtained  as  a  by-product  of  solving  (3.2).  Indeed, 
Proposition  2.3  and  the  discussion  in  Section  3.2  show  that  we  can  bound  the  spectral  norm  in 
terms  of  the  parameter  a  and  the  objective  value  obtained  in  (3.2).  □ 
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5.  The  Bourgain-Tzafriri  Theorem 


Our  proof  of  the  Bourgain-  Tzafriri  theorem  is  almost  identical  in  structure  with  the  proof  of 
the  Kashin- Tzafriri  theorem.  This  streamlined  argument  appears  to  be  simpler  than  all  previ¬ 
ously  published  approaches,  but  it  contains  no  significant  conceptual  innovations.  Our  discussion 
culminates  in  an  algorithm  remarkably  similar  to  Algorithm  1. 

5.1.  Preliminary  results.  Suppose  A  is  a  standardized  matrix  with  n  columns.  We  will  work 
instead  with  a  related  matrix  H  =  A*  A  —  I,  which  is  called  the  hollow  Gram  matrix.  The  advantage 
of  considering  the  hollow  Gram  matrix  is  that  we  can  perform  column  selection  on  A  simply  by 
reducing  the  norm  of  H . 

Proposition  5.1.  Suppose  A  is  a  standardized  matrix  with  hollow  Gram  matrix  H.  If  r  is  a  set 
of  column  indices  for  which  \\HtXT  |j  <  0.5,  then  k{At)  <  \/3. 

Proof.  The  hypothesis  ||iTrXr||  <  0.5  implies  that  the  eigenvalues  of  HtXt  lie  in  the  range  [—0.5, 0.5]. 

Since  HrXT  =  A*Ar  —  I,  the  eigenvalues  of  A*AT  fall  in  the  interval  [0.5, 1.5].  An  equivalent  con- 
1 1  1 1 2 

dition  is  that  0.5  <  ||AT£c||2  <  1.5  whenever  ||a3||2  =  1-  We  conclude  that 


k(At)  =  max 


I  Atx\ 


\xh  =  Wvh  =  1  f  < 


I  1/  1 1  2 

Thus,  a  norm  bound  for  HtXT  yields  a  condition  number  bound  for  Ar. 


0.5 


□ 


As  we  mentioned  before,  random  selection  may  reduce  other  norms  even  if  it  does  not  reduce 
the  spectral  norm.  Define  the  natural  norm  on  linear  maps  from  l ^  to  i\  by  the  formula 


IGI 


=  maxi 


\Gx\ 


x 


=  !}• 


This  norm  is  closely  related  to  the  cut  norm,  which  plays  a  starring  role  in  graph  theory  [AN04]. 
For  a  general  s  x  s  matrix  G,  the  best  inequality  between  the  (oo,  1)  norm  and  the  spectral  norm 
is  11^11^!  <  s  ||G||.  Rohn  [RohOO]  has  established  that  there  is  a  class  of  positive  semidefinite, 
integer  matrices  for  which  it  is  NP-hard  to  determine  the  (oo,  1)  norm  within  an  absolute  tolerance  of 
1/2.  Nevertheless,  it  can  be  approximated  within  a  small  relative  factor  in  polynomial  time  [AN04]. 

The  (oo,  1)  norm  decreases  when  we  randomly  sample  a  principal  submatrix.  The  following 
result,  which  we  establish  in  Appendix  A. 4,  is  a  direct  consequence  of  Rudelson  and  Vershynin’s 
work  on  the  cut  norm  of  random  submatrices  [RV07,  Thm.  1.5]. 

Theorem  5.2.  Suppose  A  is  an  n-column  standardized  matrix  with  hollow  Gram  matrix  H .  Choose 

s<\c-  st.  rank(A)] , 

and  draw  a  uniformly  random  subset  a  with  cardinality  s  from  {1,2,...,  n}.  Then 

IE  \\Haxa\\oc^1  <  -. 

In  particular,  HAtxctIIoo-h  <  s/&  w probability  at  least  1/9. 

To  connect  the  (oo,  1)  norm  with  the  spectral  norm,  we  call  on  the  celebrated  factorization  of 
Grothendieck  [Pis86,  p.  56]. 

Theorem  5.3  (Grothendieck  Factorization).  Each  matrix  G  can  be  factored  as  G  =  D1TD2  where 

(1)  A  is  a  nonnegative,  diagonal  matrix  with  trace ( 44/ )  =  1  for  i  =  1,2,  and 

(2)  HGH^i  <  II^H  <  Kg  ||G'||00_1. 

When  G  is  Hermitian,  we  may  take  D\  =  D2. 

The  precise  value  of  the  Grothendieck  constant  Kg  remains  an  outstanding  open  question,  but 
it  is  known  to  depend  on  the  scalar  field  [Pis86,  Sec.  5e] . 
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•  When  the  scalar  field  is  real,  1.570  <  tt/2  <  Kq(M)  <  7r/(21og(l  +  \/2))  <  1.783. 

•  When  the  scalar  field  is  complex,  1.338  <  Kq(C)  <  1.405. 

For  positive  semidefinite  G,  the  real  (resp.,  complex)  Grothendieck  constant  equals  the  square  of 
the  real  (resp.,  complex)  Pietsch  constant  because  ||-®*^||0O_>1  =  ll-®lloo->2- 

The  following  proposition  describes  the  role  of  the  Grothendieck  factorization  in  the  selection  of 
submatrices  with  controlled  spectral  norm. 


Proposition  5.4.  Suppose  G  is  an  s  x  s  Hermitian  matrix.  There  is  a  set  r  of  column  indices  for 
which 

|t|>*  and  ||GrXT||  <  2^:  ||G|| 

2  s  00 

Proof.  Consider  a  Grothendieck  factorization  G  =  DTD ,  and  identify  t  =  {j  :  d2j  <  s/ 2}.  The 
remaining  details  echo  the  proof  of  Proposition  2.3.  □ 

5.2.  Proof  of  Bourgain— Tzafriri.  Suppose  A  is  a  standardized  matrix  with  n  columns,  and 
consider  its  hollow  Gram  matrix  H.  Theorem  5.2  provides  a  set  a  for  which 

§ 

|ct|  >  c  •  st.  rank(A)  and  || -f^crxcr || oo <  -• 

8 

Apply  Proposition  5.4  to  the  s  x  s  matrix  G  =  HaXa  to  obtain  a  further  subset  r  inside  cr  with 

,  *  a  wn  II  <-  2Kg  iir*n 

|r|  >  -  and  ||GTXT||  <  - 

Zj  s 

Since  2Kq  <  4  and  HTXT  =  GtXT ,  we  determine  that 

c 

|r|  >  -  -st.rank(A)  and  ||iT1-xr||  <  0.5. 

In  view  of  Proposition  5.1,  we  conclude  k(At)  <  y/3. 

Now,  take  another  step  back  and  notice  that  this  here  argument  is  nearly  algorithmic.  The 
random  selection  of  a  can  easily  be  implemented  in  practice,  even  though  the  proof  does  not 
specify  the  value  of  c.  Given  a  Grothendieck  factorization  G  =  DTD ,  it  is  straightforward  to 
identify  the  subset  r.  The  challenge,  as  before,  is  to  produce  the  factorization. 


5.3.  Grothendieck  factorization  via  convex  optimization.  As  with  the  Pietsch  factorization, 
the  Grothendieck  factorization  can  be  identified  from  the  solution  to  a  convex  program. 

Theorem  5.5.  Suppose  G  is  Hermitian.  The  factorization  G  =  DTD  satisfies  ||T||  <  a  if  and 
only  if  D  satisfies 


An 


<  0. 


(5.1) 


-aD2  G 

G  - aD 2  1 

In  particular,  if  no  D  verifies  this  bound,  then  no  factorization  G  =  DTD  admits  ||T||  <  a. 

Proof.  To  check  the  forward  implication,  we  essentially  repeat  the  argument  we  used  in  Theorem  3.1 
for  the  Pietsch  case.  This  reasoning  yields  the  pair  of  relations 

G  —  aD2  =4  0  and  —  G  —  aD 2  =4  0. 

Together,  these  two  relations  are  equivalent  with  (5.1)  because 


-aD2 

G 

1 

hH 

HH 

_ i 

* 

G  -  aD2 

hH 

hH 

G 

-aD2 

“  2 

-I  I 

-G  -  aD2 

ij 

To  prove  the  reverse  implication,  we  assume  that  (5.1)  holds.  First,  we  must  check  that  djj  =  0 
implies  that  gj  =  0.  To  verify  this  claim,  observe  that 


0  > 


0 

9j 


- aD 2 


=  «  (2  \\9j\\2  ~9*jD  gj)  >a||^|| 
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because  trace(Z?2)  =  1.  Therefore,  we  may  construct  a  Grothendieck  factorization  G  =  DTD 
with  ||T||  <  a  by  setting  T  =  D^GDh  □ 


This  discussion  leads  us  to  frame  the  eigenvalue  minimization  problem 

r -olf  g 


min  Ar 


G  -aF 


subject  to  trace(.F)  =  1,  F  diagonal,  F  >  0 


(5.2) 


Owing  to  Theorem  5.5,  there  is  a  factorization  G  =  DTD  with  ||Tj|  <  a  if  and  only  if  the  value 
of  (5.2)  is  nonpositive. 

As  in  Section  3.2,  we  can  easily  construct  Grothendieck  factorizations  from  (imprecise)  solutions 
to  the  problem  (5.2).  The  proof  of  Bourgain- Tzafriri  suggests  that  an  appropriate  value  for  the 
parameter  a  =  s/4.  Furthermore,  we  do  not  need  to  solve  (5.2)  to  optimality  to  obtain  the  required 
information.  Indeed,  it  suffices  to  produce  a  feasible  point  with  an  objective  value  of  0(1). 

To  solve  (5.2)  in  practice,  we  again  propose  the  Entropic  Mirror  Descent  algorithm  [BT03]. 
Appendix  B  describes  the  application  to  this  problem.  To  provide  a  concrete  bound  on  the  com¬ 
putational  cost,  we  remark  that,  when  Ar  has  dimension  m  x  s,  forming  G  =  A*Ar  —  I  costs  at 
most  0(s2m),  and  Alizadeh’s  interior-point  method  [Ali95]  requires  0(s3,5)  time. 


Remark  5.6.  For  symmetric  G,  Theorem  5.3  shows  that  the  norm  ||G||0O__>1  is  approximated 
within  a  factor  Kg  by  the  least  a  for  which  (5.2)  has  a  nonpositive  value.  A  natural  reformulation 
of  (5.2)  can  identify  this  value  of  a  automatically  (cf.  Section  3.3).  For  nonsymmetric  G,  similar 
optimization  problems  arise.  These  ideas  yield  new  approximation  algorithms  for  the  (oo,  1)  norm. 


5.4.  An  algorithm  for  Bourgain— Tzafriri.  We  are  prepared  to  state  our  algorithm  for  produc¬ 
ing  the  set  r  described  by  the  Bourgain  Tzafriri  theorem.  The  procedure  appears  as  Algorithm  2 
on  page  11.  Note  the  striking  similarity  with  Algorithm  1.  The  following  result  describes  the 
performance  of  the  algorithm.  We  omit  the  proof,  which  parallels  that  of  Theorem  4.1. 


Theorem  5.7.  Suppose  A  is  an  m  x  n  standardized  matrix.  With  probability  at  least  3/4,  Algo¬ 
rithm  2  produces  a  set  r  =  r*  of  column  indices  for  which 

|t|  >  c  •  st.rank(A)  and  k(At)  <  V3. 

The  computational  cost  is  bounded  by  0(|r|2m-|-  |r|3"5). 


6.  Future  Directions 

After  the  initial  work  [BT87],  additional  research  has  clarified  the  role  of  the  stable  rank.  We 
highlight  a  positive  result  of  Vershynin  [VerOl,  Cor.  7.1]  and  a  negative  result  of  Szarek  [Sza90, 
Thm.  1.2]  which  together  imply  that  the  stable  rank  describes  precisely  how  large  a  well-conditioned 
column  submatrix  can  in  general  exist.  See  [VerOl,  Sec.  5]  for  a  more  detailed  discussion. 

Theorem  6.1  (Vershynin  2001).  Fix  £  >  0.  For  each  matrix  A,  there  is  a  set  r  of  column  indices 
for  which 

|r|  >  (1  —  e)  -st.rank(A)  and  k(At)  <C  (e). 

Theorem  6.2  (Szarek).  There  is  a  sequence  |A(n)}  of  matrices  of  increasing  dimension  for  which 

|t|  =  st.  rank(A)  =>■  k(At)  =  ui(  1). 

Vershynin ’s  proof  constructs  the  set  t  in  Theorem  6.1  with  a  complicated  iteration  that  in¬ 
terleaves  the  Kashin-Tzafriri  theorem  and  the  Bourgain-Tzafriri  theorem.  We  believe  that  the 
argument  can  be  simplified  substantially  and  developed  into  a  column  selection  algorithm.  This 
achievement  might  lead  to  a  new  method  for  performing  rank-revealing  factorizations,  which  could 
have  a  significant  impact  on  the  practice  of  numerical  linear  algebra. 
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Algorithm  1:  Constructive  version  of  Kashin- Tzafriri  Theorem 
KT(A) 

Input:  Standardized  matrix  A  with  n  columns 
Output:  A  subset  r*  of  {1,2, . . . ,  n} 

Description:  Produces  r*  such  that  |r*|  >  st.  rank  (A) /2  and  ||Ar||  <  15  w.p.  4/5 

1  r*  =  {1} 

2  for  s  =  4, 8, 16, . . . ,  n 

3  for  k  =  1, 2, 3, . . . ,  81og2  s 

4  r  =  Norm-Reduce(A,  s) 

5  if  || Ar||  <  15  then  r*  =  r  and  break 

6  if  |r*|  <  s  then  exit 


Norm-Reduce(A,  s) 

Input:  Standardized  matrix  A  with  n  columns,  a  parameter  s 
Output:  A  subset  r  of  {1, 2, ... ,  n} 

1  Draw  a  uniformly  random  set  a  with  cardinality  s  from  {1,2 , ,n} 

2  Solve  (3.2)  with  B  =  Aa  and  a  =  SKpy^  to  obtain  a  factorization  B  =  TD 

3  Return  r  =  {j  G  a  :  riT  <  2/s} 


Algorithm  2:  Constructive  version  of  Bourgain-  Tzafriri  Theorem 
BT(A) 

Input:  Standardized  matrix  A  with  n  columns 
Output:  A  subset  t*  of  {1,2, . . . ,  n} 

Description:  Produces  r*  such  that  |r*|  >  st.  rank(A)/2  and  k(At  )  <  \/3  w.p.  3/4 
1  T*  =  {!} 

2  for  s  =  4, 8, 16, . . . ,  n 

3  for  k  =  1, 2, 3, . . . ,  81og2  s 

4  r  =  Cond-Reduce(  A,  s ) 

5  if  k(At )  <  then  r*  =  t  and  break 

6  if  |r*|  <  s  then  exit 


Cond-Reduce(A,  s ) 

Input:  Standardized  matrix  A  with  n  columns,  a  parameter  s 
Output:  A  subset  r  of  {1, 2, ... ,  n} 

1  Draw  a  uniformly  random  set  a  with  cardinality  s  from  {1,2 , ,n} 

2  Solve  (5.2)  with  G  =  A*  Aa  —  I  and  a  =  s/4  to  obtain  factorization  G  =  DTD 

3  Return  r  =  {j  G  a  :  djj  <  2/s} 
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Appendix  A.  Random  Reduction  of  Norms 

How  does  the  norm  of  a  matrix  change  when  we  pass  to  a  random  submatrix?  This  question 
has  great  importance  in  modern  functional  analysis,  but  it  also  has  implications  for  the  design  of 
algorithms.  This  appendix  describes  some  general  results  on  how  random  selection  reduces  the 
(oo,  2)  norm  and  the  (oo,  1)  norm.  We  also  specialize  these  results  to  the  structured  matrices  that 
appear  in  the  proofs  of  Theorem  1.1  and  Theorem  1.2. 

A.l.  Random  Coordinate  Models.  We  begin  with  two  standard  models  for  selecting  random 
submatrices,  and  we  describe  how  these  models  are  related  for  an  important  class  of  matrix  norms. 

A  matrix  norm  is  monotonic  if  the  norm  of  a  matrix  exceeds  the  norm  of  every  (rectangular) 
submatrix.  More  precisely,  the  norm  |||•|||  is  monotonic  if 

||PAP'|||  <  I A I 

for  each  matrix  A  and  each  pair  P,  P'  of  diagonal  (i.e.,  coordinate)  projectors.  The  basic  example 
of  a  monotonic  matrix  norm  is  the  natural  norm  on  operators  from  £p  to  iq  with  p,q  in  [l,oo], 
which  is  defined  as 

\\A\\P^q  =  max{|| Ax||9  :  ||as||p  =  1}. 

Fix  a  number  5  in  [0, 1],  and  denote  by  Pg  a  random  n  x  n  diagonal  matrix  where  exactly 
s  =  [5nJ  entries  equal  one  and  the  rest  equal  zero.  This  matrix  can  be  viewed  as  a  projector  onto  a 
random  set  of  s  coordinates.  Therefore,  we  may  treat  APg  as  a  random  s-column  submatrix  of  A 
by  ignoring  the  zeroed  columns.  Although  this  model  is  conceptually  appealing,  it  can  be  difficult 
to  analyze  because  of  the  dependencies  among  coordinates. 

Let  us  introduce  a  simpler  model  for  selecting  random  coordinates.  We  denote  by  Rg  a  random 
n  x  n  diagonal  matrix  whose  entries  are  independent  0-1  random  variables  with  common  mean  5. 
This  matrix  is  a  projector  onto  a  random  set  of  coordinates  with  average  cardinality  5n. 

There  is  a  basic  result  connecting  these  two  models.  The  statement  here  follows  directly  from 
the  argument  in  [Tro08,  Lem.  14]. 

Proposition  A.l  (Poissonization).  Let  |||•|||  be  a  monotonic  matrix  norm.  For  each  matrix  A  with 
n  columns,  it  holds  that 

E  I AP*  I  <  2  E  I AP^  I  . 

For  each  n  x  n  matrix  H,  it  holds  that 

E  IP^iLP^II  <  2E  IllP^iPP^I!  . 

A. 2.  Reduction  of  the  (oo,2)  norm.  We  begin  with  a  general  result  on  the  (oo,2)  norm  of  a 
uniformly  random  set  of  columns  drawn  from  a  fixed  matrix.  The  basic  argument  appears  already 
in  the  work  of  Bourgain  and  Tzafriri  [BT91,  Thm.  1.1],  but  modern  proofs  are  a  little  simpler. 
(See  [VerOG,  Lem.  2.3],  for  example.)  The  version  here  offers  especially  good  constants. 

Theorem  A. 2.  Fix  5  €  [0, 1],  and  suppose  A  is  a  matrix  with  n  columns.  Then 

E  ||  AP^||0O^2  <  i/2c)(l  —  5)  ||  A||F  +  6  ||  A||00_>2  • 

We  postpone  the  argument  to  the  next  section  so  we  may  note  a  corollary  that  appears  as  a  key 
step  in  the  proof  of  the  Kashin-Tzafriri  theorem. 

Corollary  A. 3.  Suppose  A  is  a  standardized  matrix  with  n  columns.  Choose  s  <  |~2  st.  rank(A)] , 
and  write  5  =  s/n.  Then 

E  1 1  1 1  oo — >2  —  7/i- 
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Proof.  Owing  to  the  standardization,  1  <  st.rank(A)  =  n/  ||A||  .  It  follows  that 

2  st.  rank(A)  +  1  3  st.  rank!  A)  3 

d  <  -  <  - - -  =  - 7T. 


n  n  A 

Apply  the  Poissonization  result,  Proposition  A.l,  to  see  that 

IE  II  AP/j  1 1  (XJ^.2  —  2  IE  1 1  II  oo — >2  • 

Theorem  A. 2  yields 

E  <  2 y/25  ||  A||F  +  2d  HA^  . 

Since  A  has  n  unit-norm  columns,  it  holds  that  ||A||F  =  \Jn.  We  also  have  the  general  bound 
II A||00_>2  <  v^l|A||.  Therefore, 

E  ||  AP^||00^2  <  2V2 5n  +  25y/n  ||  A||  =  2 \/s  V2  +  VS  ||  A||  . 

Introduce  the  bound  on  5  and  make  a  numerical  estimate  to  complete  the  proof.  □ 

A. 3.  Proof  of  Theorem  A. 2.  We  must  bound  the  quantity 

E  =  K  WARsW^  . 

It  turns  out  that  it  is  easier  to  work  with  the  (2, 1)  norm,  which  is  dual  to  the  (00,  2)  norm,  because 
there  are  some  special  methods  that  apply.  Rewrite  the  expression  as 

ETl 

5j\(aj,  x)\ 

— 1  3  1 


where  {dj}  is  a  sequence  of  independent  0-1  random  variables  with  common  mean  5.  In  the  sequel, 
we  simplify  notation  by  omitting  the  restriction  on  the  vector  x  and  the  limits  from  the  sum. 

The  next  step  is  to  center  and  symmetrize  the  selectors.  First,  add  and  subtract  the  mean  of 
each  term  from  the  sum  and  use  the  subadditivity  of  the  maximum  to  obtain 

E  <  EmaxN^  (d7-  —  5)  \  (a.j,  cc)|+maxN^  d|(a7-,  x)\ 

=  EmaxV  (5j  —  5)  |(oj,  x)\  +  5  ||A*||2_>1 

X  J 

=  Emax^  (Sj  -  5)  \(aj,  x)\  +  5  HA)^^  . 

X  J 

We  focus  on  the  first  term,  which  we  abbreviate  by  the  letter  F.  Let  {<5'}  be  an  independent  copy 
of  the  sequence  {d.,}.  Jensen’s  inequality  allows  that 

F  =  Ernax  ( Sj  —  Ed')  | (a,j,  x)\ 

X  J 

<  E max (5j  —  d')  | (aj,  x)\ . 

Observe  that  {Sj  —  d'}  is  a  sequence  of  independent,  symmetric  random  variables.  Thus,  we  may 
multiply  each  one  by  a  random  sign  without  changing  the  expectation  [LT91,  Lem.  6.3].  That  is, 

F  <  Ernax  Y"'  £j(Sj  —  Sj)  \  (aj,  x)\ 

X  J 

where  {e^}  is  a  sequence  of  independent  Rademacher  (i.e.,  uniform  ±1)  random  variables. 

Now,  we  invoke  a  specific  type  of  Rademacher  comparison  [LT91,  Thm.  4.12  et  seq.]  to  remove 
the  absolute  values  from  the  inner  product: 

F  <  Emax  V  e7(d7  —  S'f)  (a,-,  x)  =  Ernax  iy  e7(d7  —  d')a7,  x)  . 

X  ‘  j  J  X  \  ‘  j  J  / 

Since  x  ranges  over  the  £2  unit  sphere,  we  reach 
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The  remaining  expectations  are  elementary.  First,  apply  Holder’s  inequality  to  obtain 

/  2\V2 

F<  j  • 

Compute  the  expectation  with  respect  to  {£j}  and  then  with  respect  to  {ct,-}  and  {£'■}. 

f  <  (e'E'Vj  -  5'jf  1 1  a- y  1 1 2 ) 1/2  =  (E,  W1  ~  ll®i  1 1 2 ) 1/2  =  \^M)  II^IIf  • 

Introduce  this  bound  on  F  into  the  bound  on  E  to  conclude  that 

E  I!  AP5||oo_2  <  V25(l  -  -5)  I!  A||p  +  <5 ||  A^  . 

This  is  the  advertised  estimate. 

A. 4.  Reduction  of  the  (oo,  1)  norm.  The  impact  of  random  selection  on  the  (oo,  1)  norm  has 
already  received  some  attention  in  the  theoretical  computer  science  literature  because  of  a  connec¬ 
tion  with  graph  cuts.  The  following  result  of  Rudelson  and  Vershynin  contains  detailed  information 
on  the  (oo,  1)  norm  of  a  random  principal  submatrix.  The  statement  involves  an  auxiliary  norm 

lulled  =  Ej  1 1  Hej  1 1 2  ) 

where  {e^}  is  the  set  of  standard  basis  vectors.  In  words,  we  sum  the  Euclidean  norms  of  the 
columns  of  the  matrix. 

Theorem  A. 4  (Rudelson- Vershynin).  Fix  5  £  [0, 1],  and  suppose  H  is  an  n  x  n  matrix.  Then 

E  WRsHRsW^  <  C  [j2  || H  -  diag(if)||0O_>1  +  5^2  (||tf  ||col  +  \\H*\\col)  +  5  ||diag(Ff)||0O_1'  . 

Theorem  A. 4  is  established  with  the  same  methods  as  Theorem  A. 2,  along  with  an  additional 
decoupling  argument  [BT91,  Prop.  1.9].  We  rely  on  the  following  corollary  in  our  proof  of  the 
Bourgain-Tzafriri  theorem. 

Corollary  A. 5.  Suppose  A  is  an  n-column  standardized  matrix  with  hollow  Gram  matrix  H  = 
A* A  —  I.  Choose  s  <  |"c  •  st.rank(A)],  and  write  5  =  s/n.  Then 

E\\PsHPs\\  ,<S. 

II  o  oiloo — >1  —  g 

Proof.  Suppose  A  is  a  standardized  matrix  with  n  columns,  and  define  its  n  x  n  hollow  Gram 
matrix  H .  Observe  that  the  (oo,  1)  norm  of  H  satisfies  the  bound 

ll-^lloo-u  —  n  Ill'll  —  'Rmaxlll  A*A||  —  1,1}  <  n  1 1 A 1 1 2  . 

Meanwhile,  the  ||-||col  norm  satisfies 

Hilled  <  l|A*A||coI  =  E <  n  ||A||  . 

These  facts  play  a  central  role  in  the  calculation. 

To  continue,  invoke  the  Poissonization  result,  Proposition  A.l,  which  yields 

E  WPsHPsW^^  <  2  ||R5iTR(5||00^1 . 

Theorem  A. 4  provides  that 

E  WPsHPsW^  <  C  [s2  +  53/2  ||iT||col 

where  we  have  applied  the  facts  that  H  is  Hermitian  and  has  a  zero  diagonal.  The  two  norm 
bounds  result  in  additional  simplifications: 

WPsHPsW^!  <  C  [52n||A||2  +  53/2n||A||j  =  Cs  [<5 1| A||2  +  81/2  || A||  . 
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Since  A  has  unit-norm  columns,  st.rank(A)  =  n/  ||A||2.  As  a  result,  6 
a  sufficiently  small  constant  c,  we  can  ensure  that 


I  PSHPS\ 


o< 


the  advertised  bound. 


s/n  <  c/  ||A||2.  By  fixing 


□ 


Appendix  B.  Entropic  Mirror  Descent 

The  algorithms  for  the  Kashin-Tzafriri  theorem  and  the  Bourgain-Tzafriri  theorem  both  require 
the  solution  to  a  convex  minimization  problem  over  the  probability  simplex.  It  is  important  to 
have  a  practical  algorithm  for  approaching  these  optimizations.  To  that  end,  we  briefly  describe  a 
simple,  elegant  method  called  Entropic  Mirror  Descent  [BT03].  We  then  explain  how  to  apply  this 
technique  to  the  specific  objective  functions  that  arise  in  our  work. 

B.l.  Convex  analysis.  Let  E  be  a  Euclidean  space,  i.e.,  a  vector  space  equipped  with  a  real-linear 
inner  product.  Let  D  be  a  convex  subset  of  E,  and  consider  a  convex  function  J  :  fl  — >  M.  The 
subdifferential  dJ(f)  contains  each  vector  0gE*  that  satisfies  the  inequalities 

J{h)  -  J(f )  >  (0,  h  -  f)  for  all  liGfl. 

The  elements  of  the  subdifferential  are  called  subgradients.  They  describe  the  directions  and  rates 
of  ascent  of  the  function  J  at  the  point  f.  When  J  is  differentiable  at  /,  the  gradient  is  the  unique 
subgradient. 

The  Lipschitz  constant  of  the  function  J  with  respect  to  a  norm  |||•|||  is  defined  to  be  the  least 
number  L  for  which 

I  J(h)  -  J(f)  I  <  L  lh  -  /I  for  all  h,f  ell. 

It  can  be  shown  [Roc70,  Thm.  24.7]  that 

L  =  sup{|||0|||*  :  0  6  dJ(f),  fen}. 

where  |||  jf^  is  the  dual  norm. 

B.2.  Interior  subgradient  methods.  Consider  the  (nonsnrooth)  convex  program 

min  J(f)  subject  to  fen. 

Subgradient  information  can  be  used  to  solve  this  problem,  but  caution  is  necessary  because  the 
negative  subgradient  is  not  necessarily  a  direction  of  descent.  As  a  result,  subgradient  methods  are 
typically  nonmonotone ,  which  means  that  the  value  of  the  objective  function  can  (and  often  will) 
increase.  It  is  also  common  for  subgradient  methods  to  produce  iterates  outside  the  constraint  set. 
The  classical  remedy  is  to  project  each  iterate  back  onto  the  constraint  set.  This  idea  succeeds, 
but  it  leads  to  zigzagging  phenomena. 

Interior  subgradient  methods  [BT03]  are  designed  to  eliminate  some  of  the  problematic  behavior 
that  projected  subgradient  methods  exhibit.  To  develop  an  interior  subgradient  method,  we  need  a 
divergence  measure  that  is  tailored  to  the  constraint  set.  At  each  iteration,  we  perform  two  steps: 

(1)  At  the  current  iterate  /,  compute  a  subgradient  0  e  dJ(f)  to  linearize  the  objective 
function: 

J(h)nJ(f)  +  (0,  h-f) 

(2)  Penalize  the  linearization  with  the  divergence  D(-:  f)  from  the  current  iterate,  scaled  by  a 
(large)  parameter  fd~l .  Minimize  this  auxiliary  function  to  produce  a  new  iterate  f'\ 

f  e  arg  min  {</(/)  +  (0,  h  -  /)  +  j3~lD{h;  /)}  . 
hen 
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Algorithm  3:  Entropic  Mirror  Descent 


Emd(J,  s,  T) 

Input:  Objective  function  J,  dimension  s, 
Output:  Approximate  minimizer  /  of  J 

number  T  of  iterations 

1  f =  .s-1e 

2  for  t  =  1  to  T 

{  Initialize  with  uniform  density  } 

3  Find  6  €  dJ(f®) 

{  Compute  subgradient  } 

4  R  —  /  2  log  s 

/  V  nnl 

{  Compute  step  size  } 

5  h  =  /(*)  •  exp (-00) 

{  Reweight  current  iterate  } 

6  /^+1^  =  h/  trace(h) 

{  Rescale  to  obtain  next  iterate  } 

7  end  for 

8  Return  /  6  arg  mint  J(f^) 

The  divergence  penalty  serves  two  purposes.  First,  it  ensures  that  the  next  iterate  is  close  to 
the  previous  iterate,  which  is  essential  because  the  linearization  is  only  useful  locally.  Second,  it 
simultaneously  prevents  the  iterates  from  getting  too  close  to  the  boundary  of  the  constraint  set. 
With  a  careful  choice  of  the  parameter  (3,  we  can  guarantee  progress  toward  the  optimum  set,  at 
least  on  average. 


B.3.  Optimization  on  the  probability  simplex.  The  Entropic  Mirror  Descent  (EMD)  algo¬ 
rithm  of  Beck  and  Teboulle  [BT03]  is  a  specific  instance  of  the  interior  subgradient  method  that  is 
designed  for  minimizing  convex  functions  over  the  probability  simplex,  the  set  defined  by 

A  s  =  {/eKs:  trac  e(/)  =  1,  f  >  0}. 


A  natural  divergence  measure  for  this  set  is  the  relative  entropy  function: 


D{h-  f)  =  X^=1  hi log  (^7 )  • 

An  amazing  feature  of  the  resulting  interior  subgradient  method  is  that  the  optimization  in  the 
second  step  has  a  closed  form: 


=  /?exp(— (30 j) 

1  Ej  fj  exp(-/30,-) 


for  j  =  1,2, ...  ,s. 


Algorithm  3  describes  the  procedure  that  arises  from  these  choices.  Beck  and  Teboulle  have  estab¬ 
lished  an  elegant  efficiency  estimate  [BT03,  Thm.  4.2]  for  this  method. 


Theorem  B.l  (Efficiency  of  EMD).  Let  J  :  As  — >  M  be  a  convex  function  whose  Lipschitz  constant 
with  respect  to  the  t\  norm  is  L.  The  approximate  minimizer  f  generated  by  Algorithm  3  satisfies 

J(f )  -  Aft)  < 

where  /*  is  a  minimizer  of  J. 


Algorithm  3  succeeds  with  a  wide  range  of  step  sizes.  In  particular,  when  the  total  number  T  of 
iterations  is  unknown,  we  may  compute  the  step  size  using  the  current  iteration  number  t: 


P  = 


2  log  s 


1 1101 


This  choice  increases  the  right-hand  side  of  the  efficiency  estimate  by  a  logarithmic  factor. 
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B.4.  Pietsch  factorization  via  EMD.  Suppose  B  is  a  matrix  with  s  columns.  We  can  rephrase 
the  Pietsch  factorization  problem  (3.2)  as  an  optimization  over  the  probability  simplex.  Define  the 
linear  operator 

diag  :  Rs  -»•  Msxs 

that  maps  vectors  to  diagonal  matrices  in  the  obvious  way.  We  can  write  the  convex  program  as 

min  A ma,x(B*B  -  a2diag(/))  subject  to  f  E  As.  (B.l) 

Abbreviate  the  objective  function  J  :  As  — »  M.  We  can  evidently  apply  EMD  to  complete  the 
optimization  once  we  find  a  way  to  compute  subgradients. 

We  use  methods  from  the  convex  analysis  of  Hermitian  matrices  to  determine  the  subdifferential 
of  the  objective  function  [Lew96] .  Let  A  be  an  Hermitian  matrix.  Then 

d\ max(A)  =  con v{uu*  :  Au  =  A max(A)w,  [|u||2  =  1}. 

In  words,  the  subdifferential  of  the  maximum  eigenvalue  function  at  A  is  the  convex  hull  of  all 
rank-one  projectors  whose  range  lies  in  the  top  eigenspace  of  A.  According  to  [Roc70,  Thm.  23.9], 
we  have 

8J(f)  =  {—a2  diag*(0)  :  0  G  d\max(B*B  -  a2diag(/))}. 

where  the  adjoint  map  diag*  :  Msxs  — »  extracts  the  diagonal  of  a  matrix.  In  particular,  we 
may  construct  a  subgradient  9  G  dJ(f)  from  a  normalized  maximal  eigenvector  u  of  the  matrix 
B*B  —  a 2  diag(/)  using  the  formula 

9  =  —  a2  diag*(mt*)  =  —  a2\u\2 

where  |  j  denotes  the  componentwise  squared  magnitude  of  a  vector. 

In  summary,  we  can  evaluate  the  objective  function  J(f)  and  simultaneously  obtain  a  subgradient 
9  E  8J(f)  from  an  eigenvector  calculation  plus  some  lower-order  operations.  Note  that  the  standard 
methods  for  producing  a  single  eigenvector,  such  as  the  Lanczos  algorithm  and  its  variants  [GVL96, 
Ch.  9],  require  access  to  the  matrix  only  through  its  action  on  vectors.  It  is  therefore  preferable  in 
some  settings — for  example,  when  B  is  sparse — not  to  form  the  matrix  B*B. 

Eigenvector  computation  is  a  primitive  in  every  numerical  linear  package,  so  it  is  reasonable 
to  assume  that  high-precision  eigenvectors  are  available.  In  any  case,  slight  variants  of  EMD  will 
work  with  approximate  subgradients,  provided  they  are  computed  to  sufficient  precision.  A  simple 
analysis  supporting  this  claim  does  not  seem  to  be  available  in  the  optimization  literature,  but  see 
[Kal07,  Ch.  6]  for  related  work. 

We  can  bound  the  Lipschitz  constant  of  J  with  respect  to  the  i\  norm  just  by  considering 
subgradients  of  the  form  9  =  —  az\u\  because  their  convex  hull  yields  the  complete  subdifferential. 
Since  the  eigenvector  u  is  normalized, 

Halloo  =  o^rnaxj  \uj\2  <  a2, 

we  determine  that  the  Lipschitz  constant  L  <  a2.  According  to  Theorem  B.l,  the  EMD  algorithm 
ostensibly  requires  0(a4)  iterations  to  deliver  a  solution  to  (B.l)  with  constant  precision.  In 
practice,  far  fewer  iterations  suffice. 

Remark  B.2.  The  application  of  EMD  to  (B.l)  closely  resembles  the  multiplicative  weights  method 
[Kal07,  Ch.  6]  for  solving  the  maxcut  problem  (3.4).  Indeed,  the  two  algorithms  are  substantially 
identical,  except  for  the  specific  choice  of  step  sizes  and  the  method  for  constructing  the  final  solution 
from  the  sequence  of  iterates.  The  efficiency  estimates  are  also  similar,  except  that  the  multiplicative 
weights  method  uses  the  widths  of  the  constraints  in  lieu  of  the  Lipschitz  constant.  EMD  appears 
to  be  more  effective  in  practice  because  it  exploits  the  geometry  of  the  problem  more  completely. 
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B.5.  Grothendieck  factorization  via  EMD.  Suppose  G  is  an  s  x  s  Hermitian  matrix.  The 
Grothendieck  factorization  problem  (5.2)  can  be  expressed  as  solving 


min  Amax 


—a  diag(/) 

G 


G 

—a  diag(/) 


subject  to  f  £  As. 


Abbreviate  the  objective  function  J  :  As  — >  M.  Once  again,  EMD  is  an  appropriate  technique. 

We  may  obtain  subgradients  using  the  same  methods  as  before.  Compute  a  normalized,  maximal 
eigenvector  of  the  matrix: 


-adiag(/) 

G 

u 

=  J(f) 

u 

G 

-adiag  (/)_ 

V 

V 

where 


=  1. 


Then  a  subgradient  6  6  SJ(f)  can  be  obtained  from  the  formula 


The  Lipschitz  constant  L  <  a,  so  the  number  of  iterations  of  EMD  is  apparently  0(a2).  Of  course, 
the  eigenvector  calculations  can  be  streamlined  by  exploiting  the  structure  of  the  matrix. 
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